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Identifying genes indispensable for an organism's life and their 
characteristics is one of the central questions in current 
biological research, and hence it would be helpful to develop 
computational approaches towards the prediction of essential 
genes. The performance of a predictor is usually measured by 
the area under the receiver operating characteristic curve 
(AUC). We propose a novel method by implementing genetic 
algorithms to maximize the partial AUC that is restricted to a 
specific interval of lower false positive rate (FPR), the region 
relevant to follow-up experimental validation. Our predictor 
uses various features based on sequence information, protein- 
protein interaction network topology, and gene expression 
profiles. A feature selection wrapper was developed to 
alleviate the over-fitting problem and to weigh each feature's 
relevance to prediction. We evaluated our method using the 
proteome of budding yeast. Our implementation of genetic 
algorithms maximizing the partial AUC below 0.05 or 0.10 of 
FPR outperformed other popular classification methods. [BMB 
Reports 2013; 46(1): 41-46] 



INTRODUCTION 

Identifying genes indispensable for an organism's life, as well as 
identifying the characteristics of those genes, is one of the cen- 
tral questions in biology. The essential genes of a number of or- 
ganisms have been identified and catalogued through ge- 
nome-wide knock-out experiments (1). However, identification 
of essential genes by such wet experiments is time-consuming 
and labor-intensive. Hence, it would be helpful to develop com- 
putational approaches to essential gene prediction for candidate 
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screening. For example, computational prediction of essential 
genes in pathogenic microbes has an implication in prioritizing 
antimicrobial drug targets (2). In addition to the candidate list, 
computational predictions can identify sequence features and 
related properties that are relevant to essentiality. 

Computationally, predicting essential genes amounts to a 
problem of classifying all the genes into binary classes of essen- 
tial and nonessential ones based on a number of features that 
can be calculated for each gene. These features must have some 
relevance to essentiality. Previous studies have used various 
types of features, such as comparative genomic properties, se- 
quence-based features, and functional genomic properties. It is 
well established that essential genes are more conserved than 
nonessential ones (3). Phyletic retention, the number of organ- 
isms in which an ortholog is conserved, has been used widely 
for the prediction of essential genes (4). Besides comparative ge- 
nomic features, a number of sequence-based features that can be 
directly calculated within the sequenced genome have been em- 
ployed in predicting essential genes (4, 5). For a newly se- 
quenced genome, only the sequence-based features are typi- 
cally available, and thus, essentiality prediction solely based on 
these features has important applications. 

Proliferation of high-throughput functional genomics data 
provided opportunities to explore their properties in relation to 
essentiality. For example, a protein-protein interaction network 
is crucial for most biological processes. Essential genes tend to 
play special roles, like hubs in the network, and thus, theirtopo- 
logical properties have been used to predict gene essentiality 
(6-8). In fact, Hwang et a/, combined the protein-protein inter- 
action network properties with the sequence features in predict- 
ing essential genes, and demonstrated that the essentiality pre- 
diction improved by adding the network properties (9). Gene ex- 
pression is also a fundamental process tightly regulated at a sys- 
tem level, and is related to gene essentiality. It has been shown 
that gene expression variation is related to gene essentiality (7, 
10, 11). 

We integrated all the features, based on sequence information 
or protein-protein interaction network topology, which have 
been used by other methods. In addition, we included gene ex- 
pression properties such as transcriptional activity and variation, 
as well as phyletic retention. While the inclusion of more fea- 
tures would improve the prediction power, some irrelevant or 
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distracting features might confuse the machine learning system 
or cause over-fitting. Pruning such features would improve the 
performance of the predictor, and would help to assess the rela- 
tive importance of each feature in essentiality prediction. For 
this, various techniques have been used. For example, the corre- 
lation coefficient of each feature with essentiality, and the naive 
Bayes technique were used to assess the relative importance of 
each feature as an essentiality predictor (4, 12). Wilcoxon rank 
sum tests were used to evaluate the statistical difference of essen- 
tial and nonessential genes in each feature (9). The receiver op- 
erating characteristic (ROC) curves were also evaluated for in- 
dividual features to assess their performances (9). We employed 
a more rigorous and robust method that avoids over-fitting. 
Using backward greedy search elimination, features were se- 
lected based on the ROC evaluation, followed by 10-fold 
cross-validation. 

The area under ROC curve (a.k.a., AUC) is a common per- 
formance measure to evaluate a classification method. When 
applied to the binary classification problem, the conventional 
full AUC maximization concerns the whole range of true pos- 
itive rate and false positive rate (FPR). For the practical purpose 
of prioritizing candidates that would undergo experimental vali- 
dation, it is important to produce a short list that is depleted of 
false positives. Seringhaus et a/. (4) approached this problem by 
assigning relatively higher costs to false positives than to false 
negatives. Here, we propose partial AUC maximization that fo- 
cuses on only the top scoring candidates, rather than the entire 
set. Partial AUC has been used in medical diagnostic screening, 
where a high true-positive rate to the fixed lower FPR is pref- 
erable (1 3). We developed a method that maximized the partial 
AUC directly at a given FPR of either 0.05 or 0.1 usingagenetic 
algorithm. It is one of the heuristic search techniques that have 
been inspired by evolutionary biology, and has been used wide- 
ly to solve complicated problems in various fields. We evaluated 
our method using the proteome of Saccharomyces cerevisiae, 
for which a comprehensive set of experimentally validated es- 
sential genes is available. In addition, high throughput pro- 
tein-protein interaction data and gene expression profiles of this 
model organism are readily available from DIP (14) and GEO 
(15), respectively. We made comparisons with other machine 
learning algorithms. 

RESULTS 

We compiled a total of 31 features for 5825 S. cerevisiae genes 
from various sources. After removing the genes with missing val- 
ues, the final dataset comprised of 3979 genes (including 940 es- 
sential genes). Ten-fold cross validation was used on this dataset 
for performance comparison. 

Features relevant to gene essentiality 

Pruning irrelevant or distracting features is an important step in a 
machine learning system. By maintaining only highly-relevant 
features, a simple and fast model can be built. We applied a fea- 



ture selection wrapper to the training set of each fold. For each 
fold, the feature selection procedure was repeated 50 times to re- 
duce variability (see MATERIALS and METHODS). In doing so, 
one would identify the features that are consistently selected. 
These would be the features that are the most relevant to 
essentiality. Table 1 summarizes the number of times each fea- 
ture has been selected in each fold. 

Among the 31 features, the following five have been selected 
50 times in every fold: nucleus, phyletic retention, essentiality 
index (El), clique level (KL), and variance of gene expression. 
Our findings on these highly important features for essentiality 
are concordant with previous studies as follows. First, essential 
genes tend to be localized in the nucleus, where essential bio- 
logical processes such as information storage and processing 
take place (4, 1 6). Second, essential genes are known to be more 
evolutionarily-conserved than are nonessential genes in bacteria 
(3) as well as some eukaryotes (1 7). In this regard, Gustafson et 
a/. (5) showed that phyletic retention, a surrogate for evolu- 
tionary conservation, is highly correlated with gene essentiality 
in yeast and f. coli. Third, essential genes are more likely to in- 
teract with other essential genes than nonessential genes. For ex- 
ample, Hwang et al. (9) showed that essential genes have higher 
values of essentiality index (the proportion of essential genes 
among the interacting partners) than nonessential genes. Fourth, 
clique level (the size of the largest cliques that a protein belongs 
to in a protein-protein interaction network) is related to the con- 
nectedness of a gene, which is known to be related with gene es- 
sentiality (18). It outperformed other network-based features in 
Hwang et a/.'s study (9). Finally, variance in gene expression is 
known to be negatively correlated with gene essentiality and 
evolutionary conservation rate in eukaryotes, including yeast (7, 
10). 

Besides transcriptional variability, transcriptional activity or 
gene expression level has been observed to being related to es- 
sentiality, in that highly expressed genes evolve slowly from bac- 
teria to mammals (10). In our study, mean gene expression was 
selected as one of the essentiality features most of the time 
(>90% in average). Three other sequence-based features, i.e., 
A3s, Gravy, and CLOSE STOP RATIO, were also chosen very 
often. Gravy, the general average hydropathicity, and 
TM HELIX, the number of transmembrane helices, have been 
shown to be negatively correlated to essentiality (4). They are 
correlated to each other as the hydropathy is one of the im- 
portant factors contributing to the prediction of transmembrane 
proteins. Many transmembrane proteins function as transporters 
and participate in metabolic or nonessential roles. In yeast, high- 
ly expressed (more likely essential than nonessential) genes pre- 
fer pyrimidine bases (C or T) at the third synonymous codon po- 
sition but do not prefer codons with A+T richness, which is a 
characteristics of the yeast genome (38% G + C content) (19). 
The fact that A3s, the frequency of A at the synonymous third po- 
sition of codons, was relatively frequently chosen in our feature 
selection experiments may then be due to its high frequency in 
nonessential genes. If the essential genes in yeast prefer pyr- 
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imidine bases at the third synonymous codon position, one 
would expect that C3s and T3s, the frequencies of C and T, re- 
spectively, at the third synonymous codon position, have similar 
prediction powers. Indeed, the number of times C3s was se- 
lected showed a high correlation with that of T3s (C.C. = 0.88), 

Table 1. The number of times 3 each feature has been selected in each fold 



while that of G3s showed a negative correlation (C.C. = —0.78) 
in each fold of the feature selection experiments (Table 1). 

Prediction performance on gene essentiality 

Genetic algorithms with different objective functions: In each 



Features 



Fold number 
5 6 



Average 



10 



Cytosol 1 

Extracellular 1 

Plasma membrane 1 

Mitochondria 1 

Nucleus 1 

TM HELIX 2 

T3s 3 

C3s 3 

A3s 3 

G3s 3 

CAI 3 

CBI 3 

Fop 3 

Nc 3 

GC3s 3 

GC 3 

L_sym 3 

L aa 

Gravy 3 

Aromo 3 

RAREAMI NOACI D 4 

CLOSE STOP RAIO 4 

Phyletic retention 5 

Degree K 6 

CCo 6 

BC 6 

cc 6 

KL 6 
El 6 

Mean 
Variance 7 
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0 
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50 


50 
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34 
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7 
0 
3 

37 



5 
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2 
0 
0 
16 



16 
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29 
3 
5 



3 
0 
41 



2 
0 

1 

2 
2 

25 



50 50 50 50 50 50 50 50 50 50 

50 50 50 50 50 50 50 50 50 50 

49 48 50 47 45 47 50 I 49 47 

50 50 50 50 50 50 50 50 50 50 



13.1 
0.3 
8.4 
0.1 

50 
37.8 
22.5 
17.6 
49.9 
33.8 
1.8 
26.5 
22 
4.3 
21.6 
6.2 
8.3 
8.1 
48.9 
10.9 
3.4 
47.2 
50 
6.3 
1.7 
6.9 
32.5 
50 
50 
47.3 
50 



a The number of times a feature was selected. The colored background highlights the cells that were selected either every time (dark blue), more 
than 90% (puple), or 80% (cyan) of the 50 trials in each fold, subcellular localization probabilities calculated with ConLoc. Number of trans- 
membrane helices calculated with TMHMM. 3 Condon usage freqeuncies calculated with CondonW. 4 Ratios derived from the results of CondonW. 
5 Number of organisms having an ortholog. 6 Protein-protein interaction network features. 7 Gene expression features. 
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fold, we applied three criteria for feature selection: no filtering 
(Entire), >80% (FS_80), and >90% (FS_90) of the 50 trials (see 
Table 1 and MATERIALS and METHODS). For each selected fea- 
ture subset, a linear discriminant function was trained by a ge- 
netic algorithm. As objective functions of the genetic algorithm, 
we employed two performance measures suitable for the task of 
essential gene prediction, i.e., partial AUC at 5% FPR (GA_ 
PAUC 0.05) and partial AUC at 10% FPR (GA PAUC 0.10) as 
well as AUC (GA_ AUC). Supplementary Fig. 1 shows perform- 
ance comparisons among different objective functions and fea- 
ture selection criteria. Supplementary Figs. 1 A and 1 B compare 
the performance, measured by partial AUC at 5% and 10% FPR, 
respectively. Regardless of the feature selection criteria, 
GA_PAUC_ 0.05 and GAPAUC0.10 performed significantly 
better than GA_AUC in terms of partial AUC at 5% and 10% 
FPR, respectively. Supplementary Fig. 1C shows that GA_AUC 
achieved much higher AUC values than GA_ PAUC 0.05 and 
GAPAUC0.10. From these results, we can conclude that the 
objective function used for genetic algorithms is highly influen- 
tial on their performance. Thus, the objective function should be 
matched to the performance measure. 

Our feature selection method improved the prediction perform- 
ance in general, and the performance of GA_PAUC_0.05 was sig- 
nificantly enhanced by feature selection (Supplementary Figs. 1 A 
and 1C). GA_PAUC_0. 1 0 showed performance enhancement by 
feature selection in terms of AUC (Supplementary Fig. 1C). 
However, its performance measured by partial AUC at 10% FPR 
was degraded by feature selection (Supplementary Fig. 1 B). The 
performance of GA_AUC was improved by feature selection 
when measured by partial AUC at 1 0% FPR (Supplementary Fig. 
1 B). When measured by partial AUC at 5% FPR and AUC, its per- 
formance did not show significant difference by the feature se- 
lection criteria (Supplementary Figs. 1A and 1C). We also ob- 
served that the variance in performance was reduced by feature 
selection (Entire vs. FS_80 and FS_90; see Supplementary Tables 
1, 2 and 3). By reducing the number of features, our feature se- 
lection method effectively enhances the generalization perform- 
ance of the genetic algorithm in many cases. 

Comparison with other classification methods: We compared 
our genetic algorithm-based methods (GA_PAUC_0.05, GA_ 
PAUC_0.10, and GA AUC) with other popular classification 
methods such as multilayer perceptrons (multi Layer), logistic re- 
gression (logistic), and support vector machines with RBF (radial 
basis function) and polynomial kernels (smoRBF and smoPoly, 
respectively) (Supplementary Fig. 2). All parameters for these 
methods were set as default in Weka (see MATERIALS and 
METHODS). Our genetic algorithm-based methods outperfor- 
med all the others except only one case, i.e., performance meas- 
ured by partial AUC at 5% FPR and using the entire feature set, 
where GA_PAUC_0.05 achieved the second highest perform- 
ance (Supplementary Fig. 2A). Furthermore, our methods 
showed lower variance in performance than others, regardless 
of the performance measure and feature selection criteria (see 
Supplementary Tables 1, 2, and 3). Thus, we conclude that the 



genetic algorithm-based methods are generally more useful than 
other classification methods for the task where the performance 
measure is partial AUC. Other methods have their own ob- 
jective functions. Multilayer perceptrons are usually trained by 
minimizing the squared error or maximizing the cross entropy 
function. Logistic regression is performed by maximizing the 
likelihood. Support vector machines are trained by maximizing 
the distance between the decision boundary and the closest data 
point. All these objective functions are calculated over all train- 
ing data examples. However, partial AUC is calculated over a 
subset of data examples, i.e., the examples classified as positive 
at a specific value of FPR. Thus, the optimal classifier with re- 
spect to these objective functions can achieve suboptimal partial 
AUC values. On the contrary, our methods are trained by di- 
rectly maximizing the performance measure, i.e., partial AUC 
and are likely to achieve better performance than others in terms 
of this measure. 

Feature selection had a different impact on each classification 
method. Our feature selection methods greatly improved the 
performance of multilayer perceptrons, regardless of the per- 
formance measure and the feature selection criteria (Supple- 
mentary Fig. 2). However, feature selection did not have a strong 
influence on the performance of other classification methods. In 
our experiments, multilayer perceptrons severely overfitted the 
training dataset of each fold (see Supplementary Tables 4, 5, and 
6). Overfitting can be reduced by eliminating less relevant 
features. This explains why feature selection improved the per- 
formance of multilayer perceptrons, but not the other methods. 

DISCUSSION 

The ultimate goal in developing computational methods for the 
prediction of essential genes is the predictability in under-stud- 
ied organisms based on a universal set of features that are devel- 
oped in well studied model organisms. The prediction system 
should be robust enough to be applicable to other distantly re- 
lated organisms. Deng et a/, investigated this issue in four dis- 
tantly related bacteria (1 2). In this report, we did not address the 
transferability issue, but focused on two core issues, feature se- 
lection and classification algorithms. 

For a newly sequenced genome, only the sequence-based fea- 
tures are usually available. However, with the advent of 
next-generation sequencing technology, gene expression profil- 
ing using RNA-seq technology can be performed concomitantly 
with or independent from genome sequencing. Considering that 
the gene expression properties are closely related to evolu- 
tionary conservation and essentiality, it is appropriate to include 
these features in the prediction models. Unlike gene expression 
profiles, high throughput protein-protein interaction data usu- 
ally require lengthy experiments. As gene coexpression has 
been observed between permanent, not transient, protein com- 
plex partners (20), the gene coexpression network properties 
may compensate protein-protein interaction network properties. 

The power to predict essential genes varies among various 
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types of features. Unlike most of the previous studies that meas- 
ured each feature's power independently, using statistical meas- 
ures, we employed the feature selection wrapper technique, 
which evaluated each feature based on classification models in- 
volving all the other features. As the measure of relevance to es- 
sentiality, we used the number of times a feature was selected. 
The majority of the most consistently selected features in our 
study had been previously reported as the most powerful pre- 
dictors by the other studies. It was possible to rationalize their se- 
lection in terms of biological knowledge. As such, feature se- 
lection in essential gene prediction is not only one of the essen- 
tial steps in machine learning but also has a strong biological 
foundation. 

While the computational prediction of biological effects can 
shed insight into what features play major roles, its results can 
have a bigger impact through experimental validations. It would 
then be better to suggest a candidate list that is short and de- 
pleted with false positives in order to minimize the labor-in- 
tensive and time-consuming wet experiments. In such cases, 
partial AUC, which is measured at fixed lower FPR, is then more 
important than full AUC. We implemented genetic algorithms to 
maximize directly the partial AUC. While other classification 
methods have their own objective functions, genetic algorithms 
allow the flexibility of formulating one's own objective 
functions. To our knowledge, this is the first implementation of 
genetic algorithms in maximizing partial AUC in biomedical 
applications. We believe that its flexibility and robustness may 
stimulate wide acceptance. 

MATERIALS AND METHODS 
Data preparation 

Sequence-based features: We downloaded 5885 ORF se- 
quences of S. cerevisiae from ftp://genome-ftp.stanford.edu. 
CodonW (http://codonw.sourceforge.net/) was used to calculate 
1 4 features related to the codon usage (see Supplementary Table 
7). ConLoc (http://sbi.postech.ac.kr/conloc/) was used to predict 
the subcellular localizations: cytosol, extracellular matrix, plas- 
ma membrane, mitochondria, and nucleus. The probabilities for 
these five locations were estimated and used as features, 
respectively. TMHMM Server (http://www.cbs.dtu.dk/ services/ 
TMHMM/) predicted the number of transmembrane helices of 
each of the 5825 amino acid sequences. Finally, ratios of rare 
amino acids and amino acids close to stop codons were calcu- 
lated as suggested in (4). 

Protein-protein interaction network features: We followed the 
protocol established by Hwang et al. (9). Briefly, a protein-pro- 
tein interaction data of S. cerevisiae was downloaded from DIP 
(version Scere20091230), which contained 5033 proteins and 
22,1 18 interactions. From this yeast protein-protein interaction 
network, we calculated the following topological properties as 
suggested in (9): degree (degreeK), clustering coefficient (CCo), 
betweenness centrality (BC), closeness centrality (CC), clique 
level (KL), and essentiality index (El). A list of essential genes 



from DEG (1 6) was used for calculating the El of 4662 genes. 

Gene expression features: We downloaded gene expression 
profiles of yeast obtained by Affymetrix Yeast Genome S98 
Array (9335 probes) from GEO (http://www.ncbi.nlm.nih.gov/ 
geo/) (15). Total number of samples was 2015. From the raw CEL 
files, the expression level of each probe was calculated by RMA 
(21). Then, we calculated mean and variance of expression of 
5885 genes as in (10). When multiple probes were mapped to 
one gene, we used average values. 

Phyletic retention features: We obtained phyletic retention 
data for 4728 genes of 5. cerevisiae from (5), where the phyletic 
retention was defined as the number of organisms having an 
ortholog. 

Final dataset: Finally, we obtained a dataset consisting of 
3979 genes and 31 features by excluding genes having a missing 
value. There were 940 essential genes. In our experiments, 
10-fold cross validation was used. 

Feature selection 

We employed a backward search-based wrapper for feature se- 
lection (22). Feature selection wrappers are known to produce 
better results than filter methods, such as information gain, be- 
cause they consider classification models involved. AUC (Area 
under the ROC Curve) with 3-fold cross validation was adopted 
as a performance measure for the wrapper. As a classification 
model for the wrapper, we used logistic regression. We applied 
the wrapper 50 times and selected features based on the number 
of times they have been selected. According to the result of the 
50 trials, we made three different feature groups: no filtering, 
more than 80% selected, and more than 90% selected. 

Partial AUC maximization by genetic algorithms 

The partial AUC (area under the ROC curve) is a sub-area of 
AUC measured over a range of false positive rates (see 
Supplementary Fig. 1). In order to maximize partial AUC on a 
training dataset, we used genetic algorithms. A linear discrim- 
inant function was used for scoring each essential gene candi- 
date, c, as follows. 

score(c) = w • f , 

where w is a weight vector and f denotes a set of feature values 
of c. Each feature value //was normalized into [0, 1] as follows: 

fi = (fi - min(f,)) / (max(fi) - min(/i)). 

A genetic algorithm was used for finding w which maximizes 
partial AUC. See Supplementary Table 8 for parameter setting of 
the genetic algorithm. In our experiments, we used partial AUCs 
at FPRs (false positive rates) of 5% and 1 0% as examples of ac- 
ceptable FPRs in subsequent validation. 

Other classification methods 

We used the Weka package (http://www.cs.waikato.ac.nz/ml/ 
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weka/) for comparing the performance of our GA-based methods 
with the following popular classification methods: multilayer 
perceptrons, logistic regression, and support vector machines. 
Default settings for these models in Weka were used. For multi- 
layer perceptrons, the number of hidden layers was one, and the 
number of hidden nodes was 16. The backpropagation algo- 
rithm was used for training multilayer perceptrons. A ridge esti- 
mator was used for the logistic regression model, and RBF (radial 
basis function) and polynomial kernels were used for support 
vector machines. 
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